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Abstract 

Background: The construction of prediction intervals (Pis) for future body mass index (BMI) values of individual 
children based on a recent German birth cohort study with n = 2007 children is problematic for standard 
parametric approaches, as the BMI distribution in childhood is typically skewed depending on age. 

Methods: We avoid distributional assumptions by directly modelling the borders of Pis by additive quantile 
regression, estimated by boosting. We point out the concept of conditional coverage to prove the accuracy of Pis. 
As conditional coverage can hardly be evaluated in practical applications, we conduct a simulation study before 
fitting child- and covariate-specific Pis for future BMI values and BMI patterns for the present data. 

Results: The results of our simulation study suggest that Pis fitted by quantile boosting cover future observations 
with the predefined coverage probability and outperform the benchmark approach. For the prediction of future 
BMI values, quantile boosting automatically selects informative covariates and adapts to the age-specific skewness 
of the BMI distribution. The lengths of the estimated Pis are child-specific and increase, as expected, with the age 
of the child. 

Conclusions: Quantile boosting is a promising approach to construct Pis with correct conditional coverage in a 
non-parametric way. It is in particular suitable for the prediction of BMI patterns depending on covariates, since it 
provides an interpretable predictor structure, inherent variable selection properties and can even account for 
longitudinal data structures. 



Background 

Childhood obesity is more and more becoming a pro- 
blem of epidemic dimensions in modern societies [1,2]. 
The body mass index (BMI) has proved to be a reliable 
measure to assess childhood obesity and can also be 
seen as an indicator for obesity in adulthood [3,4]. 
Therefore, the prediction of future BMI values for indi- 
vidual children may be used as a warning bell for clini- 
cians, parents and children. Predicting future BMI 
values raises awareness for problems to come - as long 
as they are still avoidable - and can thus lower the risk 
of later obesity. 

In this setting, we focus on obtaining reliable predic- 
tions for future BMI values of children. Prediction 
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intervals (Pis) offer information on the expected varia- 
bility by providing not only a point prediction but a cov- 
ariate-specific interval which covers the future BMI for 
this individual child with high probability. We construct 
child-specific prediction intervals for the LISA study, a 
recent German birth cohort study with 2007 children 
[5]. Data include up to ten BMI values per child from 
birth until the age of 10, as well as variables that are dis- 
cussed to be potential early childhood risk factors for 
later obesity, such as breastfeeding, maternal BMI gain 
and smoking during pregnancy, parental overweight, 
socioeconomic factors, and weight gain during the first 
two years [6,7]. In our analysis, we first construct Pis 
for the children's BMI at approximately the age of four, 
relying on data available for the children at the age of 
two. In a second step, we explore the longitudinal struc- 
ture of the present data and construct Pis for child-spe- 
cific BMI patterns from two up to ten years. 
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Predicting child-specific BMI values is a great chal- 
lenge from two different perspectives: From the epide- 
miological perspective, it is difficult to predict BMI 
values as they depend on factors which are hard to mea- 
sure; such as physical activity, healthy nutrition, and life- 
style habits. From the statistical point of view, the 
distribution of BMI values is typically skewed and the 
degree of skewness depends on children's age, see e.g. 
Beyerlein et al. [8], which makes standard strategies to 
construct Pis relying on distributional and homoscedas- 
ticity assumptions problematic. 

In these standard parametric approaches, first, a point 
prediction for the future BMI value is estimated based 
on mean regression models with Gaussian distributed 
errors, then a symmetric PI is constructed around that 
point based on distributional assumptions. To predict 
BMI values, however, these standard parametric 
approaches are problematic due to two reasons: not 
only the model assumptions for the point prediction 
might not be fulfilled but also the length of the PI 
depends on an assumed fixed variance which does not 
reflect the reality of an age-specific BMI skewness [9]. 
One possibility to overcome these problems would be 
the usage of more sophisticated parametric approaches, 
as for example generalized additive models for location 
scale and shape ("GAMLSS" [10]). GAMLSS are model- 
ling up to four parameters of the conditional response's 
distribution and could therefore take age-specific skew- 
ness into account. This model class has already been 
used for constructing Pis in combination with boosting 
[11]. However, the construction of Pis based on 
GAMLSS depends totally on the assumed distribution 
and the interpretation of covariate effects with respect 
to the interval borders is not straightforward. 

We avoid making distributional assumptions here by 
developing a new approach to constructing non-para- 
metric prediction intervals based on quantile boosting. 
Instead of constructing intervals around a point predic- 
tion, the new approach directly models the interval bor- 
ders by additive quantile regression [12]. The borders 
are fitted as BMI quantiles conditional on the child-spe- 
cific covariate combination. We use quantile boosting 
for the estimation [13], which offers the advantage of 
flexible and inter-pretable covariate effects and an 
intrinsic variable selection property (which is in particu- 
lar useful in high-dimensional data settings). The size of 
the resulting Pis is not fixed but depends on covariates 
- in longitudinal settings it might also depend on child- 
specific effects (corresponding to "random effects" in 
linear and additive mixed models). 

During the work on this paper, we found a severe pit- 
fall in the correct validation of prediction intervals. The 
appropriate measure for validating Pis is conditional 
coverage, not sample coverage (although being more 



intuitive) which makes it unfeasible in almost any data 
setting to evaluate the intervals in practice. The only 
way to demonstrate the correctness of Pis is therefore 
based on an empirical evaluation with simulated data. 
Thus, in a first step we evaluate the correctness of our 
approach in a set of simulation studies before applying 
quantile boosting to predict future BMI values. 

Methods 

Prediction intervals by conditional quantiles 

The idea of using quantile regression to construct pre- 
diction intervals for new observations was presented in 
[12]. In contrast to standard regression analysis, quantile 
regression - thoroughly described in [14] - does not esti- 
mate the conditional expectation E{Y\X = x) = fi(x) of a 
random variable Y but the conditional quantile function 
Q T (Y\X = x) = q T (x) for a given re (0, 1) and a possible 
set of covariates X = x. Following the definition of quan- 
tiles as inverse of the cumulative distribution function, 

cj T (x) =^y|x=x( r )' tne probability of the response Y 
being smaller than q T (x) is r: 

P(Y< q x {x)\X = x)= F Y \ X {q x {x)) = r. (1) 

The goal is therefore to estimate the conditional quan- 
tile function cf x (x) by quantile regression based on a 
training sample (y lf Xi), (y m x n ). For a new observa- 
tion, the specific covariate combination x new is plugged 
into cf T {x new ). A prediction interval for y new is then esti- 
mated by evaluating q T (x new ) at X\ = f and r 2 = 1 — |, 
leading to 

PI(1— a) C^new) = [tfj C^new)/^1 2 (^new)] • (2) 

The resulting PI should cover a new observation y new 
with probability (1 - a) while its length depends on 
x new . There might be combinations of co-variates that 
allow for a very precise prediction for y new resulting in a 
narrow interval, whereas wide intervals imply that for a 
given x new the prediction is more inaccurate. As the esti- 
mates q T (x) depend on a training sample (y lf Xi), (y w 
x n ), which are realizations of random variables Y and X, 
the boundaries of the intervals itself can be seen as ran- 
dom variables. This is an analogy to confidence inter- 
vals, which usually should cover unknown but fixed 
parameters. The boundaries of confidence intervals 
depend on the underlying sample and thus differ from 
sample to sample. Yet, for every sample, they cover the 
true parameter with a probability of 1 - a. Prediction 
intervals are constructed in the same way, but they 
cover a future realization of a random variable, which 
itself is random. The result is that the length of a pre- 
diction interval for y new is always larger than the length 
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of a confidence interval for the expected mean of Y. Pre- 
diction intervals do not only take into account the sam- 
pling error made by the estimation based on a sample, 
but also the unexplained variability of Y given X = x. In 
conclusion, as long as Y\X = x is not deterministic, the 
length of the corresponding PI - in contrast to a confi- 
dence interval - does not reduce to 0, not even for infi- 
nitely large sample sizes. 

Conditional coverage vs. sample coverage 

We stated that a correctly specified prediction interval 
PI(i - a)(*new) covers a new observation y new with prob- 
ability n; : = (1 - a). To validate a method for fitting Pis, 
we obviously need a certain amount of new observa- 
tions: From a single observation (y new , x new ) it is impos- 
sible to verify if PI(! _ a )(^; new ) is correct. It either covers 
y new or not - both events do not prove anything, at least 
if a is not 0. Yet, if we have a certain amount of new 
observations, there still exist two different interpreta- 
tions for the coverage probability m 
Sample coverage 

For any new sample y - {y\,...,y n ) T and corresponding 
covariates x = (xi, ...,# W ) T about (1 - a) - 100% of the 
new sample y will be covered by the n prediction inter- 
vals Pl^i),...,?!^). The coverage refers to the whole 
sample. To evaluate sample coverage in practice, one 
estimates the coverage probability by averaging over dif- 
ferent Pis: 

^ = E(y e PI W ) = ^LlI^ (3) 

n 

where /{♦} is an indicator function. 
Conditional coverage 

For any x new and a corresponding sample (y lf # ne w)>— > 
(jm 0) about (1 - a) • 100% of the observations with 
the particular covariate combination x new will be cov- 
ered by the prediction interval PI(# new ). The coverage 
therefore refers to observations belonging to this x new . 
To evaluate conditional coverage in practice, one esti- 
mates the conditional coverage probability by averaging 
over different new observations for one PI: 

7T |x new = E(Y £ PI(x new )|X = x new ) 

E- =1 J{y^Pifaew)} (4) 

n 

Although sample coverage is the more intuitive inter- 
pretation of Pis, it is obvious that conditional coverage 
reflects in a better way what we really expect from a PI. 
For example, after constructing a 95% PI for the BMI of 
a child at the age of four, given all information available 
from the child as a two-year-old, we particularly expect 
the future BMI of this child with its exact measures to 



be covered with a probability of 95%. In frequentistic 
language, the BMI of 95% of children with exactly the 
same measures should be covered by the interval. The 
coverage should hold for every child and every possible 
combination of covariates not only on average for all 
children. 

Hence, to show the correctness of Pis it is particularly 
not enough to show that Pis cover the right amount of 
observations on average from a new sample. This is 
further illustrated by a small example in Figure 1. For a 
simple univariate regression setting, two different pre- 
diction intervals were fitted: Both hold the sample cov- 
erage, but only one holds the conditional coverage. The 
first one, represented by the blue lines in Figure 1, relies 
on conditional quantiles fitted by linear quantile regres- 
sion. It is an adequate interval for every possible x, it 
holds the conditional coverage and it adapts to the het- 
eroscedasticity found in the data. The second one, 
drawn by red lines, is a "naive" interval, depending on 
the empirical quantiles of the response variable in the 
training sample. It does not take into account the infor- 
mation provided by x and is not adequate regarding the 
conditional coverage for any x. However, it holds the 
sample coverage. This further emphasizes the need to 
be aware of the different concepts of coverage probabil- 
ity and to clarify the precise aims of a PI analysis 
beforehand. 




i i i i i r 
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Figure 1 Example to compare sample coverage and 
conditional coverage. The blue lines represent a prediction 
interval for the response constructed by conditional quantiles. The 
red lines display a "naive" prediction interval constructed by the 
unconditioned empirical quantiles of the response in the training 
sample. 
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This finding leads to a severe problem, at least for 
multivariate prediction settings with several continuous 
covariates: For every combination of covariates only one 
response observation will be available under almost any 
practical circumstances. We will only find one child for 
each combination of covariates - not even twins will 
have the exact same measures - this is obviously not 
enough to verify the correct conditional coverage of a 
fitted PL 

Therefore, to demonstrate the correctness of a method 
fitting accurate prediction intervals, it is necessary to use 
artificial simulated data sets to evaluate the conditional 
coverage in (4) for a selected set of covariate combina- 
tions. Here, we will conduct a simulation study to evalu- 
ate if quantile boosting is a correct method to fit 
accurate conditional prediction intervals in potentially 
high-dimensional data settings before we apply this 
approach to predict future BMI values of children. 

Quantile boosting 

Recall that our aim is to construct Pis based on condi- 
tional quantiles as given in (2). In our approach, we 
determine conditional quantiles by additive quantile 
regression. For a fixed quantile r e (0,1), the conditional 
quantile function is expressed by an additive predictor 
as follows: 



Px{y, rir) 



x • |y- rj r \ 
(1 - r) • \y-rj T \ 



y>7] T 

y < Vt. 



(7) 



(5) 



The index i = 1, denotes the individual, and q T {xi) 
stands for the r-quantile of the response y t conditional 
on its specific covariate vector x t - (xn f x ip ) T . The 
quantile-specific additive predictor r\ xi is composed of 
an intercept /3 r0 and a sum of different effects of p cov- 
ariates Xi = (xa, x ip ) T on the quantile function. The 
functions f Tl , f p comprise linear effects, Le,f T j(Xij) = 
PrjXij, as well as non-linear effects whose functional form 
is not specified in advance. In fact, the additive predictor 
could also contain a wide variety of additional covariate 
effects, e.g. varying coefficient terms or spatial effects, as 
described in [13]. Note that contrary to classical regres- 
sion, there is no specific distributional assumption for 
the response in (5). The only restriction is that the 
response must be continuous. 

In general, the estimation of unknown parameters in 
quantile regression can be achieved by minimizing the 
empirical risk 



1 n 

r\ T = argmin - Y] p T (y if rj ti ), 



(6) 



1=1 



where the check function p T is the appropriate loss 
function for fitting quantiles and can be written as: 



Standard approaches for solving the optimization pro- 
blem in (6) rely on linear programming [14,15]. Quantile 
regression forest [12] is a recent approach to conducting 
quantile regression and adapts random forest [16] to 
estimate the whole conditional distribution function. 
Since this approach is based on regression trees, the 
resulting estimates cj x {x) - in contrast to the additive 
modelling approach presented here - can only be 
described as black-box predictions. Nevertheless, we will 
use quantile regression forest as benchmark in our 
simulation study. 

We will use gradient boosting for the estimation of 
the additive quantile regression model in (5), and call 
our approach quantile boosting in the following. Quan- 
tile boosting [13] was introduced as a method to flexibly 
estimate additive quantile regression models. It is an 
adaptation of component-wise functional gradient des- 
cent boosting [17] and aims at minimizing an empirical 
risk criterion, as given in (6). In case of quantile regres- 
sion, the appropriate loss is the check function (7). 

The minimization of (6) is achieved by stepwise 
updating the predictor function r\ T . Therefore, base-lear- 
ners are used, i.e. simple univariate regression models 
fitting the negative gradient of the empirical loss (7). 
The base-learners play a key role in the algorithm, since 
they define the kind of effects between each covariate 
and response. In our approach, we use simple linear 
models to represent linear covariate effects and pena- 
lized regression splines to represent non-linear effects. 
The advantage of quantile boosting is that the resulting 
predictor r\ T is strictly additive and interpretable, follow- 
ing the additive quantile regression model in (5). 

In detail, the boosting procedure works as follows: For 
each covariate, one specific base-learner is defined and 
in every boosting step the algorithm updates only the 
covariate with the best performing base-learner. This 
way, the algorithm is descending the loss by searching 
in the function space represented by the base-learners. 
If the algorithm is stopped before every base learner was 
at least once updated ("early stopping"), less important 
covariates will never have been updated during the 
boosting process and are effectively excluded from the 
final model. Thus, boosting comes along with an inher- 
ent variable selection property and produces sparse 
models in potentially high-dimensional settings. It even 
allows for candidate models that contain more covari- 
ates than observations. 

Regarding prediction, early stopping is a desirable 
property, since it yields shrunk effect estimates. Shrink- 
age of effect estimates is a widely established method in 
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statistical modelling [18,19] and tends to produce a 
more stable solution leading to an improved prediction 
accuracy of the model [20-22], even though an increase 
of the model bias (towards underlying data) has to be 
accepted. The primary aim is not to minimize the loss 
in the underlying training sample best - resulting in a 
small model bias - but to get accurate predictions with 
a small variance for new data. Since our work focuses 
on predictions for future BMI values, the shrinkage 
effect is of high relevance in our approach and is pro- 
mising in order to provide accurate Pis. 

A crucial parameter that has to be tuned with care 
during the boosting process is the number of stopping 
iterations. It should be tuned regarding the empirical 
loss in (6) on a test data sample, or - in case that no 
additional data is available - by applying cross-validation 
techniques or bootstrapping on the training data [19,23]. 
Quantile boosting is implemented within the R[24] add- 
on package mboost [25,26]. 

Simulation study 

We have already mentioned that the correct empirical 
validation of Pis should be based on conditional cover- 
age. Since it is almost impossible to evaluate the condi- 
tional coverage in practical data analyses, we carried out 
a simulation study to provide some kind of proof that 
Pis fitted by quantile boosting are provided with correct 
conditional coverage. As benchmark, we used quantile 
regression forest [12] for which an implementation is 
available in the R add-on package quantregForest 
[12,27]. 

Our simulation study aims at answering the following 
questions: 

1. Are the proposed Pis able to cover future observa- 
tions with a predefined conditional coverage 
probability? 

2. Is quantile boosting able to identify relevant infor- 
mative covariates, also in high-dimensional settings, 
e.g. data sets with a potentially large number of 
covariates? 

To investigate these questions, we generated artificial 
data from two different settings - a linear setup contain- 
ing only covariates with linear effects on the response: 

P 

Yi = 1.5 — 3xji — 2xi2 + 3xi3 + 5xi4 + Oxy 

+ (1 + ,5Xn + .5Xi2 + .5Xi3 + .5Xi4) • Sf 

and a non-linear setup including also non-linear 
effects: 



y i = 2 + 3sin^-|ij + 1.51og(xi2) 

p 

+ 2.Xi3 — 2Xj4 + ^ ^ OXfj 

i=5 

+ (o.7 + 1.5(xji — 1.5) 2 + .5(xi 2 + xis)j Si. 

The first lines of the model formulas represent the 
contribution of the covariates x 1} x p on the expected 
mean of the response y, whereas the bottom line speci- 
fies their contribution to heteroscedasticity. Both set- 
tings include only four informative covariates X ^ , . . . )X /j,. 
The error terms s t were drawn independent and identi- 
cally from a standard normal distribution, i.e. s t ~ N 
(0,1), whereas the covariates were sampled independent 
and identically from a continuous uniform distribution, 
i.e. x ip ~ £/(0,l) for the linear setup and x a , ...,x ip 

~ 11(0, 3) for the non-linear setup. To evaluate the abil- 
ity of quantile boosting to select relevant covariates, we 
generated data for both settings once in a low-dimen- 
sional scenario with p = 10 and once in a high-dimen- 
sional scenario with p = 500 which, in conclusion, 
included 496 non-informative covariates. 

For each setting, we constructed two-sided 95% Pis 

PIo.95(^new) = [^0.025 (^new)/ ^0.975 (^new)] 

in the following way: We generated in each simulation 
run a training sample (y lf (y w x n ), with n = 2000 

observations and an additional data set with 5000 obser- 
vations to select the optimal number of stopping itera- 
tions for quantile boosting. Then, we fitted additive 
quantile regression models and quantile regression for- 
est for r 1 = 0.025 and r 2 = 0.975, including all p 
covariates. 

In order to evaluate the conditional coverage of the 
resulting Pis, we pre-selected five fixed covari-ate com- 
binations x t with t = 1,..., 5, as test points and thereby 
tried to cover the ^-space. For each of the five test 
points x t , we sampled 10000 test observations y t est\#t 
which served as "future" observations. In analogy to (4), 
we then estimated the conditional coverage of the 
resulting Pis separately for each model and test point by 

^ 10000 

* ]Xt = Toooo ^ 7 ^ testI G ^ 5%(Xt) }' 

i=i 

By designing our simulation in this way, we were able 
to evaluate the conditional coverage of the constructed 
Pis and avoided the pitfall of averaging over a new sam- 
ple, corresponding to the sample coverage. 
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Predicting childhood BMI 
Data 

Data contains observations from a prospective longitudi- 
nal birth cohort study (called "LISA study", [5]), includ- 
ing newborns between 11/1997 and 01/1999 from four 
German cities. Our aim is to predict future BMI values 
for children relying on the data available when they 
were two years old. Originally, the study included 3097 
healthy children - of whom 2007 are complete cases in 
the sense that the necessary covariates at the age of two 
are all available for our analysis and at least one future 
BMI value until the age of ten is recorded. Continuous 
covariates from early childhood are the BMI of the child 
at birth (cBMIO) and as a two-year-old (cBMI2), the 
exact age of the child at the future measurement 
(cAge), the BMI of the mother at the beginning of 
pregnancy (mBMl) and the following BMI gain during 
pregnancy (mDif f BMI). The considered binary categori- 
cal covariates are the sex of the child (cSex), the area 
the child is living in (cArea - rural or urban), exclusive 
breastfeeding until the age of four months (cBreast), 
maternal smoking during pregnancy (mSmoke) and - 
with four covariate levels - the maternal level of educa- 
tion (mEdu - increasing by level). As potential response 
variables, the data comprises BMI values at approxi- 
mately the age of four (cBMI4), six (cBMI6) and ten 
(cBMHO). See [9] for further description of the LISA 
study. 

Cross-sectional analysis 

The aim of our first analysis was to construct prediction 
intervals for future BMI values of individual children at 
approximately the age of four, relying on all information 
available from the child as a two-year-old. Therefore, we 
constructed two-sided 95% Pis with the following addi- 
tive quantile regression model: 

<fr(*i) = PrO + /r(cBMI2i) + ftiCBMIOj 

+ £ T2 cAge f + j8 T3 mBMIi + ^mDiffBMIf 
+ ^ T5 cSexi + £ T6 cAreai + ^cBreast,- 
+ /^gmSmokef + ^ T 9mEdu2f + ^ T i 0 mEdu3i 
+ ^ T nmEdu4i 

Here, q x [%i) denotes the r-quantile of the response 
cBMI4 for child i with covariate combination x t . It will 
represent the borders of child-specific Pis for t x - 0.025 
and r 2 = 0.975. We included a nonlinear effect for 
cBMI2 and linear effects for all other covariates in our 
candidate model. 

As a benchmark, we compared Pis resulting from our 
approach to black box estimates for PIo.95(^new) fr° m 
quantile regression forest. Yet, it was impossible to eval- 
uate the conditional coverage of the Pis in our analysis 
as already discussed above. As a consequence, we 



focused on the empirical loss (6) for model comparison, 
which can be seen as a reliable measure not to validate 
but to compare algorithms fitting Pis by quantile regres- 
sion. Thus, we determined the empirical loss for the 
two quantiles and both models in a 10-fold cross-valida- 
tion analysis. The optimal stopping iteration for quantile 
boosting was selected by 25-fold bootstrapping on each 
of the 10 training data sets separately. Goodness-of-fit 
of the chosen models was assessed by a recent approach 
presented in Wei and He [28], originally developed for 
conditional growth charts. We generated test samples 
from the conditional model distribution and compared 
them to the observed empirical distribution of the 
response, see [28] for details. 
Longitudinal analysis 

In a second step, we tried to explore the longitudinal 
structure of the data at hand and constructed prediction 
intervals for BMI patterns of children until the age 
often, relying again on all information of the child as a 
two-year-old. As response, we now considered individual 
BMI values cBMI^ for child i at three different time 
points t e {1,2,3} corresponding to the age of approxi- 
mately four, six and ten. Note that related applications 
with similar longitudinal settings are the estimation of 
reference growth charts [29] and conditional growth 
charts [28]. We fitted the following additive quantile 
regression model for x\ = 0.025 and r 2 = 0.975: 

4r(*it) = Pro + ha + fr T2l cAge it + /i T (cAge it ) 

+ / 2T (cBMI2i) + £i T cBMI0 f + /3 2r mBMIi 
+ /?3 T mDiffBMIi + jft 4r cSexj + ^5 T cAreaf 
+ ^ 6T cBreastf + ^ 7T mSmokei + /?8rniEdu2i 
+ ^ 9T mEdu3i + ^io T niEdu4i 

This model contains child and quantile specific inter- 
cept b T u and slope b z2 i to account for the correlation 
between repeated measurements of the same child, 
which typically occurs in longitudinal data. These indivi- 
dual-specific "random" effects are estimated by a spe- 
cially designed base-learner employing L x regularization 
methods [30]. In connection with L x regularization, 
quantile regression for longitudinal data was first pro- 
posed by Koenker [31]. Here, we also include indivi- 
dual-specific slopes and smooth non-linear effects in the 
flexible predictor. 

Contrary to the cross-sectional analysis, cAge is 
included and differs for different time points. The non- 
linear fixed effect f lT describes an overall BMI pattern 
depending on age which is valid for all children, whereas 
the random effects b x2i express child-specific linear 
deviations from this overall BMI pattern. All other cov- 
ariates are time-constant. Again, we used the method 



Mayr et al. BMC Medical Research Methodology 2012, 12:6 
http://www.biomedcentral.eom/1 471-2288/1 2/6 



Page 7 of 1 3 



presented in [28] to assess goodness-of-fit, in this case 
separately for the three different time points. 

The optimal stopping iteration for the boosting algo- 
rithm was selected by applying subject-wise bootstrap. 
For this setting, it was impossible to compare quantile 
boosting to the benchmark algorithm, since quantile 
regression forest cannot account for a longitudinal data 
structure. Thus, we only calculated the Pis for BMI pat- 
terns of "new" children by ten-fold cross validation. To 
determine child-specfic Pis, for those children the child- 
specific intercepts and slopes were set to zero, which 
corresponds to their expected mean. 

Results 

Simulation study 

Table 1 shows the resulting mean conditional coverage 
from 100 simulation runs for 95% Pis. Quantile boosting 
clearly outperforms quantile regression forest for both 
setups. Only for the borders of the ^-space (# 4 , x 5 ) in 
the high-dimensional scenario, the Pis fail to cover 95% 
of "future" observations. 

Figure 2 further illustrates the concept of conditional 
coverage. The boxplots display the empirical distribution 
of the "future" observations for each of the test points 
The solid black lines are the true conditional 
quantiles and represent the true borders of a 95% PI for 
each test point. The colored lines show the resulting 
estimated PI borders from 100 simulation runs of the 
two algorithms (quantile boosting on the left in blue, 
quantile regression forest on the right in red). As dis- 
played by Figure 2 (non-linear setup, low-dimensional 
scenario), quantile boosting seems to work best in the 
center of the #-space, which is represented by test point 



Table 1 Results simulation study 



95% Pis 




p = 10 




p = 500 




mboost 


quantregForest 


mboost 


quantregForest 


Linear setup 










7t\X\ 


0.9454 


0.9948 


0.9361 


0.9997 


7t\X2 


0.9489 


0.9689 


0.9425 


0.9889 


TC\X?> 


0.9466 


0.9561 


0.9418 


0.9609 


Tt\X4 


0.9437 


0.9307 


0.9400 


0.9471 


Tt\Xs 


0.9405 


0.9310 


0.9373 


0.9534 


Non-linear 










setup 










7t\X\ 


0.9486 


0.9721 


0.9662 


0.9832 


TC\X2 


0.9494 


0.9925 


0.9623 


0.9961 


TC\X3 


0.9490 


0.9940 


0.9521 


0.9954 


7t\X4 


0.9460 


0.9785 


0.9407 


0.9792 


7t\Xs 


0.9314 


0.8743 


0.9171 


0.8942 



Mean conditional coverage resulting from 95% Pis for both setups and both 
scenarios. In every row, the value of the better performing algorithm (with the 
mean conditional coverage closer to the expected coverage of 95%) for each 
setup is printed in bold. 



x 3 . For the other test points, the standard errors for the 
estimated quantiles get larger, yielding less accurate Pis. 
Quantile regression forest have more problems in fitting 
the correct conditional quantiles, which further explains 
why the resulting Pis fail to achieve the conditional cov- 
erage in Table 1. 

Figure 3 displays the resulting effect estimates from 
100 simulation runs of quantile boosting in the linear 
high-dimensional setup. The blue lines represent the 
quantile-specific true coefficients, which combine the 
effect of the covariate on the expected mean as well as 
on heteroscedasticity. The effect estimates correspond- 
ing to the non-informative covariates are combined in 
the rightmost boxplot. Therefore, Figure 3 illustrates the 
ability of the algorithm to select relevant covariates 
while intrinsically incorporating shrinkage of effect 
estimates. 

In conclusion, Pis fitted by quantile boosting seem to 
cover future observations with the predefined coverage 
probability, conditional on the test points. The best 
results can be observed in the center of the #-grid. 
Quantile boosting outperforms the benchmark in both 
setups - linear and nonlinear setup - and for both sce- 
narios - for the low-dimensional as well as for the high- 
dimensional scenario. However, the evaluated simulation 
setups did not include interaction terms - which could 
have favored quantile regression forest. For our data 
analysis, we can rely on the result that Pis constructed 
by quantile regression lead to correct conditional cover- 
age probabilities. Furthermore, we can benefit from 
quantile boosting since the algorithm is able to select 
relevant covariates and yields sparse models in high- 
dimensional scenarios. 

Predicting childhood BMI 
Data 

To get a first impression of the data at hand, Figure 4 
shows the empirical BMI distribution depending on age. 
It illustrates the age-specific skewness of the BMI distri- 
bution, beginning somewhere after the age of six, as 
well as the longitudinal data structure with repeated 
BMI observations per child at birth and around the age 
of 2, 4, 6, and 10. 
Cross-sectional analysis 

In our first cross-sectional analysis, we ignore the longi- 
tudinal character of the data and fit 95% Pis for the 
BMI around the age of four with data available from the 
children as two-year-olds, by both quantile boosting and 
quantile regression forest. Figure 5 shows the resulting 
Pis for six randomly chosen children (that were left out 
in the fitting process) emphasizing that length and level 
of the resulting Pis are in fact child-specific. The mean 
length of the Pis for all children is 3.55 kg/m 2 , while the 
lengths of the Pis range from 2.81 kg/m 2 to 5.62 kg/m 2 . 



Mayr et al. BMC Medical Research Methodology 2012, 12:6 
http://www.biomedcentral.eom/1 471-2288/1 2/6 



Page 8 of 1 3 



20 - 



10 - 



-10 - 



estimated boundaries of Pis 

truequantile 



- B 



B B 



I 

B 



20 - 



10 - 



-10 - 



estimated boundaries of Pis 

truequantile 



- B 



H B 



I 

B 



X 2 



x 3 



x 4 



x 5 



x 2 



x 3 



x 4 



x 5 



c;«,,^o x test x test 

Figure 2 Results from the non-linear setup in the low-dimensional scenario: Boxplots represent the distribution of y test . Black solid lines 
show the true conditional quantiles, while the colored lines represent the fitted conditional quantiles from 100 simulation runs for quantile 
boosting (left) and quantile regression forest (right). 



Thus, the BMI prediction for some children is more 
precise than for others. 

Table 2 contains the estimated covariate effects for 
quantile boosting. It can be observed that not all covari- 
ates are selected during the boosting process, which 
again reflects the variable selection property of quantile 
boosting. For the 2.5% BMI quan-tile, cSex, cBreast 
and mEdu are excluded from the model, whereas for the 
97.5% BMI quantile cBMIO, mDif f BMI and cBreast 
are excluded. For both quantiles, mBMI and cArea are 
chosen with similar effects. This can be interpreted as 
follows with respect to the Pis: Both PI borders for a 
child from a urban area, for example, are shifted by 
-0.029 kg/m 2 and -0.075 kg/m 2 compared to the PI bor- 
ders for a child from an rural area. Interestingly, the 
effect of mSmoke has different signs for different quan- 
tiles, meaning that the effect of maternal smoking dur- 
ing pregnancy seems to be negative for lower BMI 
quantiles and positive for upper BMI quantiles. This 
results in a wider PI for a child whose mother smoked 
during pregnancy. The estimated non-linear effects of 
cBMI2 are presented in the Additional file 1, Figure SI. 

At first glance, the Pis resulting from quantile regres- 
sion forest - the red-colored Pis in Figure 5 - seem to 
be very similar to those from quantile boosting: the 
mean length is 3.48 kg/m 2 , ranging from 2.46 kg/m 2 to 
6.62 kg/m 2 . We also conducted a quantitative compari- 
son between the algorithms by 10-fold cross-validation. 
Figure 6 displays the empirical loss distributions of the 



estimated quantiles on children iteratively left out in the 
fitting process. These results suggest that quantile boost- 
ing outperforms quantile regression forest with respect 
to accuracy in the estimation of the PI borders 
^0.025 (*new) an d ^o.975 (*new) • This result is further sup- 
ported by the goodness-of-fit diagnostic plots in the 
additional files (Additional File 1, Figure S2). The plots 
refer to the underlying models which were used to esti- 
mate the borders of the Pis and show a slightly 
improved goodness-of-fit for quantile boosting. Even 
though we cannot check the conditional coverage of our 
Pis here, we rely on the findings from the simulation 
study and conclude that quantile boosting does not only 
provide Pis with interpretable additive effects, but also 
yields more accurate predictions than quantile regres- 
sion forest. 
Longitudinal analysis 

In a second step, we used all information of the children 
at the age of two to predict their BMI patterns until the 
age of ten. Therefore, we included child-specific inter- 
cepts and slopes in the quantile boosting approach. Fig- 
ure 7 shows the resulting Pis for six randomly chosen 
children. 

Again, level and length of the Pis are child-specific, 
but the lengths of Pis at the age of ten are larger than 
the lengths at earlier time points. This seems to be rea- 
listic as we try to predict BMI values of children at the 
age of ten, only relying on information available as two- 
year-olds. The mean length of the Pis of all children is 
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Figure 3 Results from the linear setup, high-dimensional scenario, quantile boosting: Boxplots display the empirical distribution of 
the estimated coefficients for q 002 5 (left) and q 0975 (right), obtained from 100 simulation runs. The blue lines represent the underlying 
true coefficients without shrinkage. 



4.78 kg/m 2 , ranging from 2.52 kg/m 2 to 11.28 kg/m 2 . 



The 



increased length of the intervals again results from the 
children getting older. This result is further emphasized 
by the estimated non-linear effects of cAge (presented as 
Figure S3 in the Additional file 1). The estimated effect 
for the 97.5% BMI quantile, i.e. the upper border of the 
Pis, is strongly increasing after the age of six, whereas the 



effect for the lower border remains constant. This result 
also corresponds to the empirical age-specific BMI distri- 
bution observed in Figure 4. Apparently the resulting Pis 
reflect the risk of childhood obesity kicking-in some- 
where after the age of six. 

Effect estimates for other covariates are included in 
Table 2. The pattern of selected covariates roughly 
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Figure 4 Empirical BMI distribution in the LISA study depending on age. The blue dashed lines represent the empirical quantile curves, i.e. 
£?o:025 and ^975 respectively, whereas the black solid line corresponds to the median. 
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Figure 5 Observed BMI patterns and resulting Pis from the cross-sectional analysis. Blue intervals were fitted by quantile boosting, red 
ones by quantile regression forest. The six randomly selected children were part of the cases left out from the fitting process in the cross- 
validation analysis. 



Table 2 Linear effect estimates for the LISA study: 
quantile boosting 



Cross-sectional analysis Longitudinal analysis 
Variable r = 0.025 r = 0.975 r = 0.025 x = 0.975 



Intercept 


14.208 


14.867 


14.627 


12.723 


cAge 






f(-) 


f(-) 


CBMI2 


m 




ft) 


f(-) 


cBMIO 


0.008 








mBMI 


0.028 


0.034 


0.029 


0.132 


mDif f BMI 


0.026 








cSex = male 




0.068 






cArea = urban 


-0.029 


-0.075 




-0.043 


cBreast = yes 










mSmoke = yes 


-0.228 


0.296 




0.158 


mEdu = 1 (low) 




0.162 




0.162 


mEdu = 2 




0.406 




0.176 


mEdu = 3 




0.130 




-0.107 


mEdu = 4 (high) 




0.070 




-0.092 



Resulting effect estimates for the borders of 95% Pis with the quantile 
boosting approach. Only effects of selected variables are displayed. Non-linear 
effect estimates are presented as Figure S1 and Figure S3 in Additional file 1. 



corresponds to the cross-sectional analysis. Even though 
the effect signs and sizes show minor differences for 
some covariates, such as mEdu, the other effects on the 
PI borders remain stable across analyses, including the 
non-linear effect of cBMI2 (Additional file 1, Figure S3), 
confirming the presence of these effects. Diagnostic 
plots (Additional file 1, Figure S4) show a satisfying 
goodness-of-fit of the underlying models for the ages of 
four and six. Poorer results are obtained for the age of 
ten, which reflects the limited information available for 
this long-term prediction. 

Discussion 

The aim of the present work was to construct prediction 
intervals for future BMI values of individual children. 
We pursued this aim by applying quantile boosting - a 
boosting approach estimating additive quantile regres- 
sion models - to directly model the borders of the Pis. 
As a result, we do not rely on any distributional 
assumptions. 

A main advantage of Pis fitted by quantile boosting is 
that we can directly interpret the estimated effects with 
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Figure 6 Empirical loss distributions of the estimated quantiles 
on children iteratively left out in the fitting process in the 10- 
fold cross-validation. Black points are the mean empirical loss 
from the 10 folds for every quantile and fitting method separately. 
Lower values indicate a more precise estimation of the 
corresponding quantile. 



regard to the interval borders. From the results of the 
cross-sectional analysis, for example, it follows that chil- 
dren whose mothers smoked during pregnancy have lar- 
ger estimated Pis than other children. These 
conclusions could not have been drawn from quantile 
regression forest, an alternative approach to fitting non- 
parametric Pis, which leads to black box estimates. 

The results of our simulation study suggest that quan- 
tile boosting outperforms quantile regression forest with 
respect to conditional coverage - which in our view is 
the key performance measure to evaluate Pis correctly. 
However, it is generally not possible to check condi- 
tional coverage in practical applications. In our data 
analyses, we thus had to rely on the findings from the 
simulation study. These findings were supported by the 
results of a formal comparison of empirical risks in the 
cross-sectional analysis, suggesting that quantile boost- 
ing provided more accurate predictions than quantile 
regression forest. 

We could also benefit from the inherent shrinkage 
and variable selection properties of boosting in our ana- 
lysis. Only a limited number of covariates was selected 
by the boosting algorithm, leading to sparse models. 
Note that it would even be possible to apply quantile 
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Figure 7 BMI patterns and resulting Pis from the longitudinal analysis. The six randomly selected children were part of the cases left out 
from the fitting process in the cross-validation analysis 
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boosting to data sets with more co-variates than obser- 
vations, i.e., in high-dimensional data settings. A limita- 
tion coming along with the shrinkage property is the 
absence of standard errors estimations for the effect 
estimates. As a result, we cannot compute statistical 
tests regarding the effects of covariates, e.g. report infor- 
mation about their significance. Although researchers in 
practice often feel uncomfortable in the absence of p- 
values, we think that this limitation is acceptable here, 
as the focus is directed towards getting reliable 
predictions. 

The resulting Pis of the longitudinal analysis empha- 
size further strengths of quantile boosting for fitting Pis. 
Relying on data available of the children as two-year- 
olds, we can fit accurate and child-specific Pis not only 
for BMI values around the age of four, but also for BMI 
patterns until the age of ten. Quantile boosting allows 
to explore longitudinal data structures by including indi- 
vidual-specific "random" effects, emphasizing the child- 
specific character for the resulting Pis. Here, we could 
observe that the lengths of the intervals strongly 
increase with the age of the children. From a methodo- 
logical view, this absolutely reflects what we should 
expect from a valid method to fit Pis: The intervals do 
what they should, in reporting the increasing uncertainty 
in the prediction of BMI values until the age of ten 
based only on very limited information from the chil- 
dren in early childhood. 

The lack of covariates explaining physical activity, 
nutrition and lifestyle habits of the children is of course 
a further limitation of the presented work. It would be 
interesting to see if this information could help for get- 
ting more precise predictions as presented in this paper. 

Conclusion 

In conclusion, we think that quantile boosting is a pro- 
mising approach to construct prediction intervals with 
correct conditional coverage in a non-parametric way. It 
can be applied to longitudinal settings and is therefore 
in particular suitable for the prediction of BMI patterns 
or similar data, where assumptions of standard para- 
metric approaches are not fulfilled. 

Additional material 



the simulated conditional distribution to the real ones. Blue points and 
bars refer to the results of quantile boosting whereas red points and bars 
refer to those from quantile regression forest. Figure S3 Resulting 
estimates for the the non-linear partial effect of the BMI at the age of 
two (left) and the age of the child (right) on the Pis for childhood BMI 
patterns. The lines represent the partial effect on q 0 . 025 and q 0:975 
respectively as the borders of a 95% PI in the longitudinal analysis. Figure 
54 Goodness-of-fit diagnostic plots according to [28] for the underlying 
models from the longitudinal analysis (BMI of children at the ages of 
four, six and ten). Separately for the three different time points, test 
observations were simulated from the conditional model distribution and 
compared to the empirical distribution of the response observations in 
QQ-plots (first row). Barplots (second row) show the standardized 
deviation of quantiles from the simulated conditional distribution to the 
real ones. 
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